SUGI 27: Discriminant Analysis, A Powerful Classification Technique in Data Mining
نویسنده
چکیده
Data mining is a collection of analytical techniques used to uncover new trends and patterns in massive databases. These data mining techniques stress visualization to thoroughly study the structure of data and to check the validity of the statistical model fit which leads to proactive decision making. Discriminant analysis is one of the data mining techniques used to discriminate a single classification variable using multiple attributes. Discriminant analysis also assigns observations to one of the pre-defined groups based on the knowledge of the multi-attributes. When the distribution within each group is multivariate normal, a parametric method can be used to develop a discriminant function using a generalized squared distance measure. The classification criterion is derived based on either the individual within-group covariance matrices or the pooled covariance matrix that also takes into account the prior probabilities of the classes. Non-parametric discriminant methods are based on non-parametric group-specific probability densities. Either a kernel or the k-nearest-neighbor method can be used to generate a non-parametric density estimate in each group and to produce a classification criterion. The performance of a discriminant criterion could be evaluated by estimating probabilities of mis-classification of new observations in the validation data. A user-friendly SAS application utilizing SAS macro to perform discriminant analysis is presented here. Car93 data containing multi-attributes is used to demonstrate the features of discriminant analysis in discriminating the three price groups, “LOW”, “MOD”, and “HIGH” groups. INTRODUCTION Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover new trends and patterns in massive databases. These analyses lead to proactive decision making and knowledge discovery in large databases by stressing data exploration to thoroughly study the structure of data and to check the validity of statistical models that fit. Discriminant Analyis (DA), a multivariate statistical technique is commonly used to build a predictive / descriptive model of group discrimination based on observed predictor variables and to classify each observation into one of the groups. In DA multiple quantitative attributes are used to discriminate single classification variable. DA is different from the cluster analysis because prior knowledge of the classes, usually in the form of a sample from each class is required. The common objectives of DA are i) to investigate differences between groups ii) to discriminate groups effectively; iii) to identify important discriminating variables; iv) to perform hypothesis testing on the differences between the expected groupings; and v) to classify new observations into pre-existing groups. Stepwise, canonical and discriminant function analyses are commonly used DA techniques available in the SAS systems STAT module [SAS Inst. Inc. 1999]. CAR93 data containing multi-attributes, number of cylinders (X2), HP (X4), car width (X11), and car weight (X15) are used here to demonstrate the features of discriminant analysis in classifying three, “LOW (2)”, “MOD (3) ”, and “HIGH (1)” price groups. A user-friendly SAS macro developed by the author utilizes the latest capabilities of SAS systems to perform stepwise, canonical and discriminant function analysis with data exploration is presented here. The users can perform the discriminant analysis using their data by following the instructions given in the appendix and by downloading the SAS macro-call file from the author’s home page at http://www.ag.unr.edu/gf. SUGI 27 Statistics and Data Analysis
منابع مشابه
Discriminant Analysis, A Powerful Classification Technique in Data Mining
Data mining is a collection of analytical techniques to uncover new trends and patterns in massive databases. These data mining techniques stress visualization to thoroughly study the structure of data and to check the validity of the statistical model fit which leads to proactive decision making. Discriminant analysis is one of the data mining tools used to discriminate a single classification...
متن کاملMulti-Group Classification Using Interval Linea rProgramming
Among various statistical and data mining discriminant analysis proposed so far for group classification, linear programming discriminant analysis has recently attracted the researchers’ interest. This study evaluates multi-group discriminant linear programming (MDLP) for classification problems against well-known methods such as neural networks and support vector machine. MDLP is less compli...
متن کاملFeature reduction of hyperspectral images: Discriminant analysis and the first principal component
When the number of training samples is limited, feature reduction plays an important role in classification of hyperspectral images. In this paper, we propose a supervised feature extraction method based on discriminant analysis (DA) which uses the first principal component (PC1) to weight the scatter matrices. The proposed method, called DA-PC1, copes with the small sample size problem and has...
متن کاملFeature selection using genetic algorithm for classification of schizophrenia using fMRI data
In this paper we propose a new method for classification of subjects into schizophrenia and control groups using functional magnetic resonance imaging (fMRI) data. In the preprocessing step, the number of fMRI time points is reduced using principal component analysis (PCA). Then, independent component analysis (ICA) is used for further data analysis. It estimates independent components (ICs) of...
متن کاملSUGI 27: To Neural or Not to Neural? -- This Is the Question
Adding new lines of business and new services for existing customers is particularly important in deregulated telecommunication industry. Attracting new services to existing customers often translates into happier customers, increased retention of profitable customers, and competitive advantage. Identification of the most profitable customers (who are the most probable buyers of additional or n...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002